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(54) Optical character recognition system having context analyzer 



(57) An optical character recognition (OCR) system 
(2) is provided, in which syntactical and semantic rules 
(6), provided along with an input image (4) to be scanned 
and applicable to the contents of the scanned image, are 
used in connection with the results of the OCR scan to 
identify the scanned characters. As a result, the recog- 
nition rate and conf idence are enhanced. By providing 

5 2 

L 



the checking based on syntactical and semantic rules 
within the OCR system (2), application programs (8) 
which would receive and use the OCR results are freed 
from the added burden of having to perform their own 
syntactical and/or semantic checking on the OCR results 
the application programs (8) receive from the OCR sys- 
tem (2). 
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Description 



The process of OCR is prone to errors. Than*™ an nro «Z " " z!?T . ra ? cter . 

for segmenting a connected corrponent, and several ^ar^eT'c^'^^^^r"^ 1 ^ attema,lVes 

■»<*«. ante rcsprn**,, fc, aawtartS,,*, rtSSSr Wfc:a,,<,, ' 

^ atm ar^cte « Ihe pra « ai« o.sreone the Imato as ttemo ' 

or "specification.- written in the Docum£a£^^ A J***"- 

information to be recognized DSL is aectaLd symax ^ ^ semantics of the 

While the invention is primarily disclosed as a svstem. it will ha i .nHar^^ x 
^^•^asar^modieeu^ 

storage, a connecting bus. and other aooroDriatP mmnnn^ ~Z ^V^T^ auG,n9 a CPU « memor y» I/O, program 
ratus and artides of n«nufacture also fall witiiin the 
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FIG. 2 is a table of OCR results from scanning an input image of the character sequence 2/5/94, each column of 
the table corresponding with one of the connected components (94 being a single connected component), and each 
tabulated item representing a recognized character sequence and a corresponding probability-of-accurate-recog- 
nition measurement. 

s FIG. 3 is a table of possible character types (n representing a numeric, and - and / respectively representing a dash 
and a slash) produced from the tabulated items in the OCR scan results table of FIG. 2. 
FIG. 4 is a table of character choices for the model 2/5/94, based on the OCR results given in FIG. 2. 
FIG. 5 is a table of character choices for a scan of the city name FRESNO, which satisf ied the syntax phase of the 
recognition method of the invention, and are to be used in the semantics phase thereof. 

10 

With reference now to the figures, FIG. 1 shows the general architecture of an OCR system in accordance with the 
invention, and its operating environment. The OCR system is generally shown as 2. The OCR system 2 receives as 
inputs an input image 4, containing an image of text to be recognized, and a document description 6. The image may 
be an array of image pixels produced by an image scanner in known fashion, or some other suitable representation of 
is the character image. The document description 6 is preferably provided using the DSL, which will be described in detail 
below. 

The document description is part of a predetermined text content constraint More specifically, it includes syntax 
and semantics information with which the information in the input image is expected to conform. The document descrip- 
tion is used to improve the accuracy of the OCR system's output in a manner to be described below. 
so The OCR system 2 performs optical character recognition on the input image 4, using the document description 6 
to improve the accuracy of recognition, and provides the resultant recognized text as an output to an application program 
8. The nature of the application program 8 is not essential to the invention. However, through the use of an OCR system 
in accordance with the invention, the application program 8 is relieved of the need to perform context based checking 
on the OCR results, thereby saving processing time as well as the application program designer's programming time. 
25 Additionally, the application program is less expensive to purchase or develop, because its capabilities need not include 
context-based checking of the results of the OCR. 

The operation of the system of FIG. 1 works essentially as follows. Through a process of compilation, the document 
description 6 is compiled to produce a context structure 1 0, which is usable by the OCR system 2 in character recognition. 
The application program 8 invokes a context analyzer 12, within the OCR system 2, and directs the context analyzer 12 
30 to access the context structure 1 0 and the input image 4. 

The context analyzer 12 invokes a recognition engine 14, which may be any suitable device for performing OCR, 
such as a neural network, image processor, etc. Such recognition engines, which are known in the art, produce tokens 
which identify characters recognized (at least tentatively) in the scanned input image, and can also include confidence 
or probability measurements corresponding with the recognized characters. 
35 The context analyzer 12 and the recognition engine 14 communicate through an object oriented buffer 16. The 
objects in the buffer are all character string variables identified by name. 

Also, a set of semantics routines 18, for providing additional predetermined text content constraints which are used 
to increase the recognition accuracy, are provided. (Preferably, the semantics routines 18 include a suitable mechanism 
to include new user-defined routines. Many such mechanisms, such as system editors, are known.) The context analyzer 
40 1 2 also accesses the semantic routines 1 8 through the object oriented buffer 1 6. 

At execution time, the context analyzer 12 behaves like a parser. The recognition engine 14 produces tokens. The 
context analyzer 12 tries to match the tokens with the syntax specification expressed by the context structure 10, and 
makes sure that semantic constraints given in the semantic routines 18 are not violated. Since there may be several 
choices for a character (that is. a given input image might be recognized as either a 0. an O, or a Q). the parsing generally 
45 involves backtracking (known to those skilled in the art of OCR). When the recognition process completes, the application 
program 8 retrieves the recognized information from the object buffer 16. 

At this point it is useful to include the following two definitions, which are taken from the Webster's II New Riverside 
Dictionary (New \fork: Berkley Publishing Group, 1984), pp. 628, 698: 

SYNTAX: the way in which words are put together to form sentences, clauses, and phrases. 
so SEMANTICS: 1 . the study of meaning in language, esp. with regard to historical changes. 2. the 6tudy of the rela- 
tionships between signs and symbols. 

As will be seen from the discussion which follows, the syntax of the text being recognized, i.e., the sequence of 
alphas, numerics, punctuation marks, etc., will be used in a first phase of checking of the OCR results. The semantics, 
i.e., the existence of a recognized name within a dictionary of known valid names, the consistency between city name 
55 and ZIP code within that city, etc., will be used in a second phase of checking. 

The details of the DSL language are not essential to the invention, but are examplif ied in a preferred implementation 
as follows. A DSL program has one FIELD statement for each field. A field is defined as a portion of the text which is 
expected to be found within a scanned text image, the portion having a known continuity. For instance, rf addresses are 
to be scanned, the ZIP code would be one field of the address, and the continuity is the fact that five numeric digits make 
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up a ZIP code. A FIELD statement has the following format: 

FIELD field name field type field coordinates 
where the field name is a character string, the field type is actually the name of the type, and the field coordinates are 
four integers which delimit dimensions within the scanned image. coordinates are 

Here is an example of a FIELD statement for a certain field w1 which contains a ZIP code 
FIELD w1 ZIP 124. 1200. 1 50. 1400 

TYPE INFORMATION 

The type information, such as the ZIP code in the above example, must be known to the OCR system It can be a 
^ ^S^r^^l' alphabet, an elementary type, or a composite type. Elementary and cornposite Kpes may exist 

in a predefined library, or be defined by the user, using the following facilities. ^compoate types may ex,st 

« oasic alphabet is one of the following: NUM, UPPER_CASE LOWER_CASE etc 

ALPHABET hexadec (NUM, "ABCDEF") 
ALPHABET octal ("01234567') 

F, the first of these two examples includes all sixteen hexadecimal digits 

ELEMENTARY and COMPOSITE types may be easily understood from their relationship to each other elementary 
types generally being subsets of composite types. Each will be discussed separately. e.ementary 

First, consider several examples of elementary types. There are no two ways or formats of wrttirm a *-Hi«* 710 

as JSH 1^1 T n ° tW ° 01 etementar y alone, such as the digits of an area code (such 

as 3^^mi*^^i1 h telephone number (such as 927). or the extension of a telephone number 

(sucn as 3999). By contrast, now let us consider a few examples of composite types There are several «f 
a fu» telephone number, such as (408) 927-3999. For example. 408^7^0^8^ ^^ ' 
The strings 95120. 408. 927. 3999 can be seen as instances of elementary types, while the conSete tetohone 
number ,s an example of a composite type. An elementary type is defined by an ELEM TYPE VSm in Te oil 
program Such a statement has the following format: " statement in the DSL 

ELEM_TYPE type_name PHRASE or WORD name LENGTH COND. 

Tl e 8re ke y words Aether or not spaces are allowed, name specifies the abhabet 

Optionally, the name of a d.ctonary or routine may also be added at the end of the statement 

namec^^ 

ELEMTYPE ZIP WORD. NUM. LENGTH=5 

Eadi composite type is defined by a TYPE statement in the DSL program, it has the following format 
1 YPb name pairs list output 

^S'^^^^rT 6 i ?i 1 ! < forma1: element name <*""•* «»* ****** ^ set of elements that 
compc^thK composite type, list is a list of acceptable representation(s), and output is the representation to be used 

representations, but for particular apphcatrons. some other suftablerepresertafon may be used as the output rep^^^ 

11191^1!!^''''!!!^' fhe representation is a sequence of elernerrt names and/or string constants. For example, the definition 

ELEMTYPE area WORD, NUM, LENGTH=3 
ELEM TYPE prefix WORD, NUM, LENGTH=3 
ELEM_TYPE ext WORD, NUM, LENGTH=4 
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TYPE phone a(area), p(prefix), e(ext) 
REP a ")" p e 

REP ape 
REP ■'(•■ a M )" p e 

OUTPUT "(•' a ") " p e 



Note that the REP statements given in the composite TYPE statement are preferably ordered by order off likelihood of 
occurrence of the representations. The OUTPUT representation will be the one that the output string will have, inde- 
pendently off how the information was initially written. Thus, by having several different REP statements geared to rec- 
ognizing different ways in wtiich the composite type may be expressed, and a single OUTPUT statement all strings 
recognized, regardless of format will be output in the same format. 

NAMING OF ELEMENTS 

Since all fields are named and elements inside a field are also named, each element in the form is uniquely identified 
by the following: field_name.element_name For example, using the above def irdtion of the composite type "phone", one 
may define a field containing the home phone number as: 

FIELD home_phone. phone ... 
Then. home_phone.area uniquely identifies the home area code. 

DICTIONARIES AND ROUTINES 

The definition of an elementary type, as shown above, may specify the name of a routine or dictionary to improve 
recognition. Routines are incorporated in the system from a system library or a user library Their def inition in C is always: 

int routinejiame (pjn, p_out) 
where p_jn points to the input information (a list of data elements, confidence, number of choices to be returned) and 
p_put points to the output information (choices and confidences). An integer returned is a 1 if there is a solution, 0 
otherwise. Suppose, for example, that the validity of a number can be checked by invoking a routine rtnl . The Routine 
clause in the ELEM_TYPE statement will be: 

ROUTINE rtn! 
A routine may also modify the received data element 

Dictionaries are defined, using a DICTIONARY statement with the following format: 

DICTIONARY dict_name (fieW_name_1) INfitename 
The argument of the IN portion of the DICTIONARY record identifies a location (such as a subdirectory) where the 
dictionary file resides. Here is an example of a DICTIONARY record, for a dictionary of first names: 

DICTIONARY fetname (name) IN /dir/hame1.dict 
In this example, the dictionary would have the following format: 

John 

Peter 

Mary 

The format may also include an extra column containing the frequency of each entry, i.e., the frequency of occurrence 
of each of the first names in the dictionary. 

The TYPE definition for a f irst_name would then refer to this dictionary as follows: 

TYPEfirst_name WORD, .... DICT (fstname) 
In this example, the dictionary context condition is associated with an elementary type. That is, first names are an 
elementary type because they do not come in a plurality of different formats (except of course, for combinations of upper 
and lower case characters). However, dictionaries and routines can also be applied to a collection off elements. For 
example, suppose we have a dictionary with multiple columns, expressed in the following general format: 
DICTIONARY dict_name (field jTame_1 , ffieU__name_2, ... ) IN file_name 

The following is a concrete example off a dictionary containing address information that can be used in connection 
with scans of mailing addresses: 

DICTIONARY address (state, ZIP, city) IN Adir1/dir2/my_addr.text 
Items listed in the address dictionary would have the following format: 
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CA95120SANJOSE 
NY 10010 NEW-YORK 

Suppose now, that three fields, designated w7, w8, and w9. are defined for state. ZIP code, and city Then one 
can associate with the last field w9.a CHECK statement, as follows: "~e.anoc.iy. inen,one 

CHECK w9 address (w7.state. w&ZIR w9.crty) 
that will enforce the constraints that the submitted arguments constitute a triple that is in the dictionary "address" That 
is. rf an OCR scan of a text string identifies a cfty by name in part of the siring, then any state and ZIP code information 

^^JSI^IT th l!^" 9 muSt mat * one rt * e Bs,in 9 s in me "«* include the identified city 

and any different valid states and ZIP codes. 

THE CONTEXT ANALYZER 

. _ . "77 ~T~' "~ — -»«» »■«»•»-«• uio «««» lerurnea oy me recognition engine according to the con- 
straints imposed by the DSL program. The overall problem may be couched in terms of a mathematical programming 
problem. Interpreting the complete document consists of receiving a set of choices provided by the OCR system from 
S'^fS;^^ !S 000 CharaCt8r <^ and/or a certain syntactic or semantic alternative in the 
DSL specrf cation. Any OCR choce has a certain probability, and therefore a solution (an acceptable sequence of deci- 
sions) has itself a probability obtained by combining the elementary probabilities of the choices. 

H all Possible combinations of choices are considered as possible solutions, such a method leads to a combinatorial 
exptoaon of different combinations. However, in accordance with the invention, special techniques are used to limit the 
explosion to a manageable number of combinations, and to increase performance. 

To explain the overall functioning of the system, a simple execution algorithm which relies on controlled enumeration 
is used. The implementation will make sure that techniques such as branch and bound are used everywhere possible 
6X ! mp ' let US consWer me common numerical way of expressing the calendar date. The date February 5th' 
1994 may be expressed as 2/5/94 or 2-5-94. (among others). Let us assume the following DSL specification: 

ELEM_TYPE smalln WORD, NUM, LENGTH<=2 

TYPE date (mm( smalln) , dd(smalln), yy(smalln)) 

REP (mm "/" dd "/" yy ) 

REP (mm "-" dd "-" yy) 

OUTPUT (mm "/" dd "/" yy) 
FIELD w6, date, . . . 



together with a set of OCR results, which are shown in FIG. 2. In this DSL specification, the two REP statements define 
two representations of the date, which differ in that the delimiters are dashes and slashes respectively. 

The context analyzer processes the OCR results in two phases: Phase 1 handles syntactic constraints; Phase 2 
takes care of semantic constraints. 

PHASE 1: SYNTACTIC CONSTRAINTS 

^ J^^^f?' ^a'SOrithm utilizes syntactic constraints by essentially enumerating me (geneiaHy small number 
of) syntax models for a particular field. The following steps are used: 

1 . Determine the character types that are relevant to the syntax definition of the field 

In the present example, this essentially involves distinguishing between numeric characters (or connected com- 
ponents of more than one numeric characters) and delimiter characters, and determining which of the two delimiter 
characters (dashes and slashes) have been recognized. 

2. Convert the character hypotheses returned by the recognition engine into the conesponding type hypotheses 

SrTnSriS^T" to*'*™** ae numeric - °ash and stash. Then, for each connected 
component (murhpledigrt dates, months, and years), the type hypotheses can be derived from the character hypoth- 

For example. FIG. 3 is a table of possible types which were hypothesized from the OCR results in FIG 2 The 
dashes and slashes are as shown, and all numerics are represented by the letter n. In the right-most column of FIG. 
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2. the three numerics 9, 4, and 2 are possible choices. They are all single-digit numerics, and thus are represented 
by the single entry of n (representing a single-digit numeric) in the right-most column of FIG. 3. In the left-most 
column of F IG. 2, three possible choices were recognized, the two numerics 2 and 7, and the two-character sequence 
of 9-. Two entries are in the left-most column of FIG. 3: one with a single numeric digit n, representing the 2 and 7, 

5 and a numeric followed by a dash, representing the 9-. 

3. Enumerate the possible models, starting with the most probable types. 

In the example, the analysis yields a list of possible syntactic models that could match the OCR results while 
satisfying the syntactic constraints in the DSL specification. In this example, it happens that there is one possibly 
w matching model for each of the two representations defined in the DSL specification, above. 

For the first representation, the model is n/n/nn, and for the second representation, the model is n-nn-nn. Each 
of the two models are made possible because of the occurrence of dashes and slashes in the scanned results of 
FIG. 2 P and the possfole types of FIG. 3. Conversely, if there were, for instance, no slash along with the 9 at the top 
of the fourth column of FIG. 2, then the model for the first representation would not be available based on the 
is characters as they exist in the OCR results. It would be possible, however, to go into another level of context analysis 
by interpreting the 1 in the 10, immediately below the /9 in the fourth column of FIG. 4, as a slash rather than a 1 . 
This interpretation would then revive the model for the first representation. 

The first representation is deemed more probable based on the probabilities given along with the individual 
OCR results which are consistent with the respective models. 

20 

4. Replace the types by the actual values. This operation is straightforward. FIG. 4 shows a set of actual character 
choices for the first syntax. For instance, the month digit was recognized as either a 2 or a 7, with 2 having a higher 
recognition confidence. The probabilities can be accumulated along the paths. The full set of solutions would yield 
2/5/99. 2/5/94, 2/5/92. 7/5/99, 7/5/94, and 7/5/92. However, if we assume that we have no other a priori information 

25 on the dates, the best choice is obtained by simply picking up the best choice for each character, according to the 
confidence levels expressed in parentheses in FIG. 2, yielding 2/5/99. The same is done for the other valid syntaxes; 
global probabilities are used to choose the optimum. 

However, if more a priori knowledge on the semantics of dates exists, Phase 2 will exploit it. 

30 

PHASE 2: SEMANTIC CONSTRAINTS 

Let us suppose that the a priori knowledge is imbedded in a year routine that returns a Boolean expression, of value 
1 if the year is valid and 0 if the year is invalid. Then , the DSL program is modified to include the semantic check, as follows 

35 

ELEM_TYPE smalln WORD, NUM, LENGTH<=2 

ELEMTYPE year WORD, NUM, LENGTH<=2, "year" 

TYPE date (mm(smalln) , dd(smalln), yy(year) ) 



REP 


mm 


dd 


■I y •• 


yy) 


REP 


nun 


dd 


« _ »* 


yy) 


OUTPUT 


mm 


"/" dd 


M J ft 


yy) 



FIELD w6, date, ... 

Phase 2 is invoked as soon as a result for a single element (such as yy) is identified in the process step 4. 
discussed above. If the element is associated a semantic constraint (such as "year") the constraint is checked. If the 
so constraint is satisfied, the value is accepted. Otherwise the process continues to find the best solution. 

When the current representation has been matched, the results and their overall confidence levels are stored in the 
object buffer. When all representations have been processed, the best solution of all those accumulated is picked. In 
the example, the process unfolds as follows: 

ss for the first rep: first and only syntax: n/n/nn 

replacement by characters: 2/5/99 

semantic check: 99 fails (assume the check is for past dates) 

next hypothesis: 94 
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semantic check: OK. Accept 2/5/94 
store this hypothesis in the object buffer. 

for the second rep: similar process. Accept 9-1 3-69 
5 Pick up best. 

GENERALIZATION 

10 sJ^^Z Thttl S S 0n ? me 03,6 ,ieU < Februar y 5 - 1 "*) -"I* Now ™* general cases will be con- 
io sidered. Essentially, the generalization goes in two directions: 

1 . Extension to handle dictionaries and routines that are not boolean. 

2. Extension to mult-phase checking. 
NON-BOOLEAN CHECKING 



IB 



20 



25 



Until now we have only seen how boolean routines (returning 1 or 0) are exploited In fact the Drocess exni a in~i 
above generates solutions made out of combinations of letters tfiatexistinme^R^utts. P ^ 66 
'"I 6 "° n ' to0 ' ean routines are considered which, from the OCR values, generate results by computation 

C^fr T "* n0t in * e ° CR *«*~ ^ U6e 01 a **«y enters into L XT 

ConsKler. for example, a form thai is only used in the State of California, and which contains a field city Assume Hs 

dicBoTar^ 0 ' 6 "* * be passed to a ^ seareh routi "e that will find the best matching values in a 

MULTI-STAGE CHECKING 

so Wenowexterrtmeexampletoafield^ It is clear thai several 

s«nanbc consents are relevant (1) Fresno must beavalidcityn a rne.(2)93650r^beavalldzVS^S 
but not least. 93650 must be a ZIP code in Fresno. That is where multi-stage checking comes in 
One possible solution is to write, in DSL: 

ELEM_TYPE ZIP WORD, NUM. Iength=5 
as ELEM_TYPE city PHRASE, my_string 

FIELD w2, city.... 
FIELD w4, ZIP, ... 

CHECK DICT address (w2.ciry. w4.ZIP) 

40 CHFrJ h tt^^ n ^ Bun !^ a6eX ^ inedbefore - «*«e future semantic checks will happen as a result of 
40 CHECK, the choices of solutions for city can simply accumulate in the object buffer the OCR resX The same Z 

happen for ZIP. Then the execution of CHECK will execute a fuzzy search that covers both city aS 

The mechanism provides much flexibility Fa example, another reasonable option is to accept only valid city names 

during the cr^^ 

« « kiL** *** ™ e °* ^native methods depend on Z sem^cs^SS 

« flexibility is provided both by the language and the analyzer. 



so 
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What follows is a high level pseudo code that describes the overall functioning of the context processor. 

Context_Processor : 
for each window 

for each representation 

find all syntactic models that satisfy the grammar 

(looking only at types); 
for each syntactic model 

for each data element in the model 
/* replace by actual characters */ 
if no semantic , use most probable chars only; 
if semantic is delayed (until CHECK with diet) 

save OCR results in Object buffer (OB); 
if semantic through diet for elem type 

invoke search routine with OCR results; 
if other semantic on individual element 
consider next character combination 

(and stack position); 
if satisfies semantics, accept and go on with 

next data element; 
else backtrack to get next combination; 
After last data element 
save solution in OB 

(keep only the n best where n can be specified); 
backtrack to try next choice; 
When no backtracking left 

loop to handle next syntactic model; 
When no syntactic model left 

loop to handle next representation; 
When last representation, loop to handle next window; 
When last window, end; 



CHECK_with_dictionary: 

use info stored in OB to find the best dictionary entries and updates OB. 
For each "column- in the dictionary, the initial info is OCR result choices or set of values (see parsing algorithm 
above). CHECKwith^routine: 

always use data elements that have been identified during the parsing. 
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THE OBJECT BUFFER 



« J 3 ^ * e object buffer was described as a sin-pie mechanism to pass values of named variables. However seman- 
38 ^Sf?^ 6, ™e information about Ihe values and confidence of the OCR engine It also 

rel.es on the capabilrty of changing a result or a set of results available for a subsequent stage of checking^ 
^L r ^° 9nrti ! n **** te results in me ^ as a specific predetermined structure. The preferred structure 

Smjbrly. the temporary and final results of the context processor, produced as described above, are also stored in 



the buffer 

» = J^^T^^!^.^" 9St d3ta buffer ; they can also update the buffer by changing values, reducing 

An' a '^ir~ r " T" repiace a sinsie va,ue »y a set for further checking and selection. 

An archrtecture has been disclosed for context checking in OCR. ft is centered on a Document Specification Lan- 

SStSSTS bnB l afleS ^ ^ * 8 ,0n9 time as a -neansof«xnrm,nicat?nar^n^to 

heb formatting. Here. DSL is geared towards hebing recognition. 

The system organization allows for sharing many functions among applications: 

1. it factors out the central mechanism of choosing among choices, 

2. it provides a uniform mechanism for invoking semantic routines, and 

3. it provides a unified mechanism for interchanging the kind of data involved in the recognition process. 

Once such a framework is used, it is expected to trigger the development of some lixary of DSL functions verv 
^ t °T " Programming language. The results will be more string because £5£! 

™* ^ d h ^ c r t,nes ^ P^ang some tax forms, credit applications and other forms. But. very quickJy. fte 

^ ****** 3nd an im P |emerrta «on <* *" operational system was made basld on 
asomewhtf stmpfifced vers.on of the proposal, in which process optimization was disregarded. This simplification did 
not adversely -mrx** the ar^ to harrfe the above appfications which were the primar^ec^e oTmeTSST 

r^Z!HT f "l^ T^Z'^ ,rom * e ''"Cementation of the invention. Note that the recognition 

rates are at the field level, not the character level: 

telephone number (10 digits) 

syntactic checking ONLY: 

field recognition rate increases from 28% to 44% 

single word dictionary lookup 

(last name, incomplete dictionary) 

recognition rate increases from 9% to 27% 

multi-field dictionary lookup 

(city, state, ZIP) 

recognition rate increases: 

from 8% to 60% for city 

from 18% to 64% for state 

from 28% to 50% for ZIP code 

would ^LVZZ^ at * 0dUCed reSuHs - * e qua,rty of * e ^9** was P«>r. so that the OCR engine 

Z^Ztt^Z^ reCOgnrt, ° n ^ The v**™"* ** h0wever > « lM efficiency of the conL! 
analyzer, in terms of the increase m percentages given above. 

Claims 

1 . An optical character recognition (OCR) system (2) comprising: 

means for producing a scan of an input image (4) of text to be recognized* and 

rrtned iSS^S^' * ** W ** ^ * 3 ^ 

2. An OCR system as recited in claim 1 , wherein: 

the pr^etermined text content constraint includes a syntactical constraint and a semantic constraint- and 
the OCR system further comprises: ^ 
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syntax means for checking the preliminary scan for consistency with a syntactical constraint, and 
semantics means for checking the preliminary scan for consistency with a semantic constraint. 

3. An OCR system as recited in claim 2, wherein the semantics means is operative responsive to completion of oper- 
ation of the systax means. 

4. An OCR system as recited in claim 2 or 3, wherein the syntax means and the semantic means each include means 
for interpreting text content constraints in terms of a user-programmable document specification language. 

5. An OCR system as recited in claim 4, wherein: 

the syntax means includes: 

(i) means for receiving a document descrption (6), programmed in the document specification language, per- 
taining to the text to be recognized, and 

(ii) means for compiling the document description to produce a context structure (10); and 

the context analyzer (12) includes means for checking the preliminary scan for consistency with the 
context structure (10). 

6. An OCR system as recited in any claim from 2 to 5, wherein: 

the semantics means includes a library of semantic routines (18); and 

the context analyzer ( 1 2) includes means for checking the preliminary scan for consistency with the semantic 
routines (18). 

7. An OCR system as recited in claim 6, further comprising means for facilitating modification of the library of semantic 
routines (18) by a user of the OCR system. 

8. An OCR system as recited in any claim from 4 to 7, wherein the document specification language includes instruc- 
tions for defining at least one of: 

(i) a field within the text to be recognized, 

(ii) a character type for characters occurring within the field, 
(Hi) an alphabet, and 

(iv) a representation of a sequence of characters in terms of types of the characters of the sequence. 

9. An OCR system as recited in any claim from 1 to 8, wherein: 

the OCR system further comprises a dictionary containing a set of valid text items; and 
the context analyzer (12) includes means for performing a fuzzy search through the dictionary to identify best 
matching values among the text items. 

10. An OCR system as recited in any claim from 1 to 8, wherein: 

the OCR system includes a plurality of dictionaries containing sets of valid text items for respective fields of 
the text to be recognized; and 

the context analyzer (12) includes: 

(i) means for performing respective fuzzy searches through the dictionaries to identify best matching values, 
among the text items, for respective fields of the text to be recognized, and 

(ii) means for comparing the best matching values of the respective fuzzy searches to identify a best combination 
of the best matching values. 

11. An OCR system as recited in any claim from 1 to 10, further comprising an object oriented buffer (16) coupled to 
the context analyzer (12) for passing values of variables. 

12. An OCR system as recited in any claim from 2 to 10, further comprising: 

a recognition engine (14) coupled to the context analyzer (12) for performing an initial character recognition 
procedure on the image; 

an object oriented buffer (16) coupled to the recognition engine (14) for receiving and storing results of the 
initial character recognition procedure in a predetermined structure, for storing results of the context analyzer (12), 
and for providing updataWe data to the semantic means. 
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13. Amethodfcrperloi™^ the method 

comprising the steps of : 

providing syntax definitions of an expected content of the field of the image; 

determining character types that are relevant to the syntax definitions of the field; 

operating a recognition engine on the image to produce character hypotheses of a content of the image and 
probability values for the character hypotheses; 

converting the character hypotheses into a character type hypotheses; 

enumerating possible models for the image content based on the character type hypotheses and on the 
probability values; 

replacing the character type hypotheses with character values to produce a set of solutions; and 
selecting one of the solutions as the recognized text. 



14. A ccrrputsr pi wuuu, im u©e mm m piucessing system, comprising: 

a recording medium; 

means, recorded on the recording medium, for directing the processing system to operate the metrKxJacOT 
ing to claim 1 3. 
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